Integrating Learning and Planning

Model-Based Reinforcement Learning

Model-Based Reinforcement Learning is a method that combines the model learning and the policy learning.

Model-Free RL

Model-Based RL

graph TD A(value/policy) -->|acting| B(experience) B -->|model learning| C(model) C -->|planning| A

Advantages of Model-Based RL

Disadvantages of Model-Based RL


What is the model?

Model M\mathcal{M} is a function that predicts the agent's next state and reward given the current state and action.

MDP=⟨S,A,P,R,η⟩MDP = \langle S, A, P, R, \eta \rangle

assume that the state S\mathcal{S} and action A\mathcal{A} are known. P\mathcal{P} is the transition probability function, R\mathcal{R} is the reward function, and η\eta is the discount factor.

Model Learning

Objective is to learn the model M\mathcal{M} from the experience. It is a supervised learning problem.

learning s,a→rs,a \rightarrow r is a regression problem. learning s,a→s′s,a \rightarrow s' is a density estimation problem.

A look-up table can be used to represent the model. For each state-action pair, the model stores the next state and reward.

S1×A1→S2×R1S_1 \times A_1 \rightarrow S_2 \times R_1
S1×A2→S3×R2S_1 \times A_2 \rightarrow S_3 \times R_2
â‹® \vdots

The table is updated, replacing the old state estimation with the one that results in the higher reward. However, the table can be very large, and it is not practical to store all the state-action pairs.

Another problem is that the sample count is important. If the sample count is low, the model would introduce bias.


Integrated Architectures

We consider two sources of experience: real experience and simulated experience.

Model-Free RL

Model-Based RL

Dyna

graph TD A(value/policy) -->|acting| B(experience) B -->|model-free learning| A B -->|model learning| C(model) C -->|planning| A

Forward Search

We have the model of the environment, and we can simulate the environment. However, it is not practical to simulate the environment for all possible actions. Look-ahead search is used to find the best action starting from the current state sts_t. It is just as solving an MDP.

Advantages of MC Tree Search

Advantages of TD Tree Search


#MMI706 - Reinforcement Learning at METU